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ABSTRACT 



A system for minimizing congestion in a communication 
system is disclosed. The system comprises at least one 
ingress system for providing data. The ingress system 
includes a first free queue and a first flow queue. The system 
also includes a first congestion adjustment module for 
receiving congestion indications from the free queue and the 
flow queue. The first congestion adjustment module gener- 
ates end stores transmit probabilities and performs per 
packet flow control actions. The system further includes a 
switch fabric for receiving data from the ingress system and 
for providing a congestion indication to the ingress system. 
The system further includes at least one egress system for 
receiving the data from the switch fabric. The egress system 
includes a second free queue and a second flow queue. The 
system also includes a second congestion adjustment mod- 
ule for receiving congestion indications from the second free 
queue and the second flow queue. The second congestion 
adjustment module generates and stores transmit probabili- 
ties and performs per packet flow control actions. Finally, 
the system includes a scheduler for determining the order 
and timing of transmission of packets out the egress system 
and to another node or destination. A method and system in 
accordance with the present invention provides for a unified 
method and system for logical connection of congestion 
with the appropriate flow control responses. The method and 
system utilizes congestion indicators within the ingress 
system, egress system, and the switch fabric in conjunction 
with a coarse adjustment system and fine adjustment system 
within the ingress device and the egress device to intelli- 
gently manage the system. 

19 Claims, 9 Drawing Sheets 
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METHOD AND SYSTEM FOR MANAGING 
CONGESTION IN A NETWORK 

FIELD OF THE INVENTION 

The present invention relates to computer networks and 
more particularly to a method and system for managing 
congestion in a processing system. 

BACKGROUND OF THE INVENTION 

In communications systems, it is common to reserve 
bandwidth for high priority traffic that is then transmitted in 
preference to lower priority traffic. Such lower priority 
traffic therefore must be managed to take advantage of the 
bandwidth remaining after higher priority traffic that is 
conformant to a contract has been served. This remaining 
bandwidth can vary widely depending on the activity of the 
high priority traffic. It is therefore of considerable impor- 
tance to manage the low priority traffic so as to optimize the 
use of the widely varying available bandwidth in the 
network, and, at the same time, avoid congestion in the 
network which reduces network throughput. 

It has become common to utilize window-based flow 
control mechanisms to avoid congestion in a TCP/IP packet 
communications network. Such window-based mechanisms 
pre-allocate receiver buffer credits to sources and notify the 
corresponding sender how much data can be sent. Upon 
detection of congestion, either at an egress port (if the 
receiver is an intermediate node) or within a node, the 
receiver withholds buffer credits, forcing the sending partner 
to slow down the launching of packets or to stop transmis- 
sion altogether. This process, also known as "back pressure" 
congestion control, is repeated hop by hop, eventually 
reaching the sources of traffic causing the congestion and 
forcing those sources to slow down. 

Such window-based, backpressure mechanisms perform 
efficiently with low speed networks with reasonably high bit 
error rates. As networks move toward higher transmission 
speeds and more reliable transmission media such as optical 
fibers, the window-based mechanisms no longer perform 
adequately. The cost of such hop-by-hop mechanisms 
becomes prohibitively expensive and inefficient due to the 
fact that a sender can send an entire window's worth of data 
and be required to wait for the receipt of new buffer credits 
from the receiver before continuing. The resulting silent 
period is at least as long as two propagation delays and 
results in a direct loss of throughput during this silent 
interval. Furthermore, the window-based flow control does 
not smooth the transmission of data into the network and 
hence causes large oscillations in loading due to the clus- 
tering of packets, further degrading network performance. 
Using larger windows merely worsens the silent period 
throughput degradation. 

In order to better accommodate modern high-speed and 
reliable packet communications networks, it has been pro- 
posed to use an end-to-end congestion control mechanism 
which relies on the regular transmission of sample packets 
having time stamps included therein. One such mechanism 
is disclosed in, "Adaptive Admission Congestion Control," 
by Z. Haas, ACM SIG-COMM Computer Communications 
Review, \bl. 21(5), pages 58-76, October 1991. In the Haas 
article, successive time-stamped sample packets are used to 
calculate changes in network delays that are averaged to: 
represent the state of the network. The averaged network 
delay is then used to control the admission of packets to the 
network by controlling the admission of packets to the 
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network. That is, the admission rate becomes a function of 
congestion measurements, either by controlling the inter- 
packet gap directly, or by adjusting the token rate in a 
standard leaky bucket scheme at the admission point 
5 One disadvantage of the Haas end-to-end congestion 
control mechanism is that Haas sends sampling packets at 
regular intervals regardless of the traffic load from a sender. 
Sending such sampling packets when the sender is idle is 
wasted effort and reduces the good throughput of the system. 

30 Furthermore, Haas must await the arrival of a plurality of 
sampling packets before initiating congestion control, thus 
providing too slow a response time to permit flow control as 
well as congestion control. 
Another disadvantage of the Haas scheme is the so-called 

15 "accumulation effect". If the length of queues along the 
congestion path is built up gradually by small amounts, the 
overall delay can exceed the threshold allowed for the 
overall connection without being detected by the Haas 
endpoint detection scheme. The network can therefore 

20 become congested without timely correction when using the 
Haas congestion control scheme. 

Yet another disadvantage of the Haas congestion control 
scheme is the fact that the inter-packet control gap is used to 

^ control the input packet rate. Sources of short packets are 
therefore penalized unfairly compared to sources of long 
packets when the inter-packet gap control technique of Haas 
is used to control congestion. Finally, and most importantly, 
the Haas congestion control scheme requires relatively fre- 

30 quent transmission of sampling packets to provide timely 
control information. Indeed, the overhead for such sampling 
packets can reach up to twenty percent of the entire through- 
put of the network, making the Haas congestion control 
scheme provide a lower throughput than an uncontrolled 

35 network when the traffic load is less than eighty percent. If 
the transmission rate of Haas' sampling packets were to be 
reduced to approximate the round trip delay period, on the 
other hand, the scheme simply would not work at all due to 
the paucity of control information available at the sender. 

^ That is, the averaging step used to reduce the noise in the 
control signal would make the scheme so unresponsive to 
the congestion to be controlled that the low sampling rate 
would be unable to control the congestion. 

U.S. Pat. No. 5,367,523 issued to Chong, et al; to the 

45 assignee of the present application addresses some of the 
problems associated with Haas. This patent discloses an 
end-to-end, closed loop flow and congestion control system 
for packet communications networks. It exchanges rate 
request and rate response messages between data senders 

50 and receivers to allow the'sender to adjust the data rate to 
avoid congestion and to control the data flow. Requests and 
responses are piggybacked on data packets and result in 
changes in the input data rate to optimize data throughput. 
GREEN, YELLOW and RED operating modes are defined 

55 to increase data input, reduce data input and reduce data 
input drastically, respectively. Incremental changes in data 
input are altered non-linearly to change more quickly when 
further away from the optimum operating point that when 
closer to the optimum operating point. 

60 Although this system operates effectively for its stated 
purpose, it allows neither for prioritizing of packets nor for 
viewing congestion at various levels of granularity. 
Accordingly, what is needed is a system and method that 
control congestion in a network in a manner that enables a 

65 response to congestion in each part of the system both 
locally and in the context of the overall system performance. 
The method and system should be easily implemented in 
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existing networks and should be cost effective. The present application and its requirements. Various modifications to 
invention addresses such a need. the preferred embodiment and the generic principles and 

features described herein will be readily apparent to those 
SUMMARY OF THE INVENTION skilled in the art. Thus, the present invention is not intended 

A . c ■ . . . . • 5 to be limited to the embodiment shown but is to be accorded 

A system for minimizing congestion in a communication iL . , . . ... . . . . - 

t JL a - \ a tu . — * . , the widest scope consistent with the principles and features 

system is disclosed. The system comprises at least one , r * 

. e , . ™ • * described herein, 
ingress system for providing data. The ingress system 

includes a first free queue and a first flow queue. The system Everv communication system faces the problem of flow 
also includes a first congestion adjustment module for control for data packets. Congestion resulting from the flow 
receiving congestion indicauons from the free queue and the 10 of the P acket s can arise in a variety of contexts such 
flow queue. The first congestion adjustment module gener- including the convergence of several flows contending for a 
ates and stores transmit probabilities and performs per shared classification or scheduling resource. Classification 
packet flow control actions. The system further includes a decisions must be made to eflSciently and effectively move 
switch fabric for receiving data from the ingress system and data trough the system. FIG. 1 illustrates a logical system 
for providing a congestion indication to the ingress system. 15 for scalable processing of data packets. As is seen, the 
The system further includes at least one egress system for svstem 10 shows an m S ress svstem 12 and ^ e S ress svstem 
receiving the data from the switch fabric. The egress system 16 * ^ Egress and egress systems 12 and 16 transfer 
includes a second free queue and a second flow queue. The P a ckets via a switch fabric 14. Typically, in a multiprocess- 
system also includes a second congestion adjustment mod- ™& svstem there is a plurality of ingress systems 12 and 
ule for receiving congestion indications from the second 20 systems 16 that communicate simultaneously. As is 
free, queue and the second flow queue. The second conges- also seen, each of the ingress and egress systems 12 and 16 
tion adjustment module generates and stores transmit prob- includes a free queue 20 and 34 respectively as well as a 
abilities and performs per packet flow control actions. plurality of flow queues 18a-18c and 30a-30c respectively. 
Finally, the system includes a scheduler for determining the Typically, there is also a scheduler 32 in the egress system 
order and timing of transmission of packets out the egress 25 16 n& flow queues 18a-18c and 30a-30c schedule and 
system and to another node or destination. momentarily store packets. The free queues 20 and 34 are for 

A method and system in accordance with the present raemor y mana S eraent * each of the svstems 12 a " d " 

invention provides for a unified method and system for Accordingly, each of the systems 12, 14, and 16 can 

logical connection of congestion with the appropriate flow 30 experience congestion either within a particular system or 

control responses. The method and system utilizes conges- between different systems. What is needed is a system to 

tion indicators within the ingress system, egress system, and intelligently manage congestion. 

the switch fabric in conjunction with a coarse adjustment A method and system in accordance with the present 

system and fine adjustment system within the ingress device invention provides for a unified responses. The method and 

and the egress device to intelligently manage the system. 35 system utilizes congestion indicators within the ingress 

system, egress system, and the switch fabric in conjunction 

BRIEF DESCRIPTION OF THE DRAWINGS with a coarse adjustment system and fine adjustment system 

- .„ . # i • i * r i li within the ingress device and the egress device to intelli- 

F1G. 1 illustrates a logical system for scalable processing tU ♦ a ♦ j *u j • 

of data ackets gently manage the system. A system and method in accor- 

p ' M dance with the present invention identifies a plurality of 

FIG. 2 is a block diagram of a system in accordance with w logical tests or definitions of congestion. A response to the 

the present invention. congestion could be to discard all traffic of some types, 

FIG. 3 Is a block diagram illustrating an ingress flow differentially change the rate of discard of different types of 

control system in accordance with the present invention. traffic, or to remark priority information a packet such as 

FIG. 4 is a block diagram illustrating an egress flow 4S remarking a DiffServ code point. To describe these features 

control system in accordance with the present invention. m more detail, refer now to the following description in 

FIG. 5 is a block diagram illustrating a switch fabric in conjunction with the accompanying figures, 
accordance with the present invention. FIG - 2 is a block diagram of a system 100 in accordance 

FIG. 6 is a block diagram of a per flow background update ™ th ^ P res f nt T* n *° n * system 100 includes f ^ 

block for ingress flow control in accordance with the present 50 elemenls 10 that of FIG. 1, that is an mgress system 102 an 

invention egress system 106, and a switch fabnc 104. However, these 

™„ - t . , , systems are enhanced with congestion indictors, system 

FIG. 7 is a block diagram of a per packet action block for sUte mec hanisms, and flow control action mecha- 

mgress flow control in accordance with the present inven- n isms 

tl °™„ fl . . ,, . „ . , 55 Further, congestion information is shared between the 

nG.8isablockdia^amofaper-flowbackgroundupdate mgresSj egress ^ switch fabric systems as illustrated. 



module for the egress flow control in accordance with the 
present invention. 



FIG. 3 is a block diagram of the ingress system 102 in 
. accordance with the present invention. The ingress system 

FIG. 9 is a block diagram of a per packet action module 102 includes congestion indicators 107«-107c in each of the 
for the egress flow control. 60 flow queues m a -l0$c and a congestion indicator 109 in its 

DETAILED DESCRIPTION ^ ee queue 110. In the preferred embodiment these conges- 

tion indicators are the result of a comparison between a 
The present invention relates to networks and more par- programmable threshold and the current depth of the queue, 
ticularly to a method and system for minimizing congestion The ingress system 102 also includes a plurality pipe bit rate 
in a processing system. The following description is pre- 65 modules 124a-124c which include a plurality of congestion 
sented to enable one of ordinary skill in the art to make and indicators 125a-125c. The ingress system 102 also includes 
use the invention and is provided in the context of a patent a Per Flow Background Update module 114, and a Per 
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Packet Action module 116. The ingress system 102 includes relates to the congestion within a particular egress port 

a memory 112 coupled thereto. within the system. The coarse adjustment for congestion is 

The basic logical tasks in the ingress system 102 are as made by the per flow background update modules 114 and 

follows. As packets arrive at ingress ports, the ingress 130, and the fine adjustment for congestion is made by the 

processing includes storage of packets into memory 112, the 5 per packet action modules 116 and 132, respectively. To 

notification to a packet classification mechanism of the describe the operation of the system in more detail, refer 

identity of the packets in the memory 112, classification of now to the figures in conjunction with the following dis- 

packets including determination of a transmit probability, cussion. 
enqueueing into one of the flow queues 108a-108c, and 

finally dequeuing into the switch fabric. 10 Ingress Flow Control 

FIG. 4 is a block diagram of the egress system 106 in The objective of the ingress flow control mechanism is to 

accordance with the present invention. The egress system discard packets in an intelligent fashion when there is 

106 includes a congestion indicator 129a-129c in each of congestion. The ingress flow control mechanism is activated 

the flow queues 126a-126c, and a congestion indicator 131 as an enqueue operation to one of the flow queues 

in its free queue 128. The egress system 106 also includes a 15 108a-108c. As above indicated, the ingress system 102 

plurality of bit rate modules 127a-127c which include a receives several congestion indicators as input. Based on 

plurality of congestion indicators 133a-133c. The egress these congestion indicators, based on programmable discard 

system 106 also includes a scheduler 134 for managing the probabilities, and based on a set of selectable algorithms like 

output of packets. The egress system also includes a Per random earl discard or shock-absorber random early discard, 

Flow Background Update module 130, and a Per Packet 20 the ingress flow control mechanism determines if the 

Action module 132. The egress system 106 includes a enqueue operation is successful or if the packet is discarded, 

memory 136 coup led thereto. The basic logical tasks of the ^ flow control mec hanism periodically inspects the 

egress system 106 comprise storage of packets arriving from congestion indicators and calculates transmit probabilities 

the switch fabric, notification to classification mechanisms for ^ types of pac kets. The Per Flow Background Update 

of the identity of packets in storage, calculation of transmit 25 Module 114 and the Per Packet Action module 116 are 

probabilities, and dequeueing to the target ports. utilized t0 generate and store the transmit probabilities and 

FIG. 5 is a block diagram of the switch fabric 104 in to perform the per packet flow control actions, 

accordance with the present invention. The switch fabric 104 In addition, on the ingress system. 102 there is a response 

includes a global shared memory 120 along with a plurality to congestion indicators provided from the switch fabric 104 

of flow queues 118a-118c. The flow queues 118a-118c each ^ 1} from its flow queues> „ well „ ^ the global 

mcludes a congestion indicators 119a-119c. shared memory 120 , which indicate the probability for 

Although a fixed number of elements are shown in the congestion in the switch fabric 104. When these congestion 

figures, one of ordinary skill in the art readily recognizes that indications occur, the flow control action is to delay trans- 

any number could be utilized and that use would be within 35 mission of packets from the flow queues 108a-108c to the 

the spirit and scope of the present invention. For example, switch fabric until the congestion is no longer indicated, 

although three flow queues are shown any number could be flow miaiol aspect is necessary since the rate of data 

utilized and they would be within the spirit and scope of the Uans&t across me switch fabric 104 palh typically fe very 

present invention. large, on the order of many gigabits per second, whereas the 

Measurement of congestion with the system is performed 40 path to the egress system 106 is much smaller, on the order 

both instantaneously and periodically. Referring back to 0 f a few gigabits per second. So, to the extent that the path 

FIG. 2, as is seen, the free queues (global shared memory via from the switch fabric 104 to the egress system is 

resources) 110, 128 from both the ingress and egress systems congested, it is important that the overall system adjust. 

102, 106 provide congestion information to their corre- FIG. 6 is a block diagram of a per-flow background update 

spending per packet action modules 116, 132. The periodi- 45 module 114 or the ^sc adjustment in the ingress system 

cally measured information could be status of the free queue 102 m accordance with the present invention. The per flow 

relative to one or more thresholds or the raw occupancy of background update module 114 takes the congestion indi- 

the free queue. The pipe bit rate modules 124a-124c and cator 109 from free queue U0 , me congestion indicators 

127a-127c also provide congestion information to the per I07 fl -107c from the flow queues 108a-108c, the congestion 

packet action module 116, 132. Again status relative to 50 md i ca tor 131 from the egress Free Q 128, as well as 

thresholds or raw data could be used. The free queue 128 of param eters of the selected flow control algorithm and gen- 

the egress system also provides congestion information to erates a control reS ponse by means of a logical matrix of 

the per packet action control module 110 of the ingress innmii probabilities. Typically, the per flow background 

system 102. update module 114 samples its inputs at fixed period and 

The congestion indicator 151 of the global shared 55 computes the control response. The selected flow control 

memory 120 of the switch fabric 104 as well as the con- algorithm's parameters define the size of the matrix (number 

gestion indicators in flow queues 118a-118c of the switch of transmit probabilities) and the packet classification 

fabric 104 act as a throttle for the ingress system, which will parameters used when selecting the appropriate transmit 

be described in detail below. probability from the matrix. An example would be to pro- 

An important feature of the present invention is the 60 vide different classes of service for different flows; as an 

adjustment of the data flow based upon the congestion example one packet classification may have a class of 

indicators within the ingress system 102 and the egress service definition that does not allow any discarding of 

system 106. There are two types of adjustments for conges- packets except in cases of severe congestion, while others 

tion made based upon the congestion indicators. The first may permit discarding of packets at lower congestion levels, 

type is a coarse adjustment for congestion. This typically 65 Within the set of class of service that permils discarding of 

relates to the overall congestion of the system. The second packets at lower congestion levels, there can be a hierarchy 

type is a fine adjustment for congestion. This typically of services that vary the probability of frame discard for a 
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congestion state. Further, the value of the transmit prob- Types and different congestion conditions. It writes the 

abilities are varied due to the congestion state of the system results into a transmit probability memory 1164 found in the 

at the time the response is calculated. Per Packet Action module 116. 

One output of the per-flow background update module The per packet action module 116 receives the Current 

114 is a transmit probability matrix which is an average 5 System Measurements from the per flow background update 

desired transmission fraction for each class in each pipe; i.e. module 114 as well as packet classification information. The 

the average fraction of all packets in a pipe to be transmitted. operation of the per packet action system 116 will be 

The other packets are to be discarded. Typically, the per- described in detail below. 

class, per-pipe transmission fractions axe refreshed with a pic. 7 is a block diagram of a per packet action module 

period ranging in the interval 100 microseconds to 10 10 U6 m t h e mgress system 102 in accordance with the present 

milliseconds by a Transmit Probability Engine 1144. In the invention. Inputs are, current system measurements, and 

preferred embodiment, the Transmit Probability Engine is a pac k e t classification information. Packet classification infor- 

corabination of hardware and software. The implementation mation provides pipe membership which in turn provides 

selection of the engine 1144 is a trade off between hardware processing constants and per-pipe constants such as mini- 

and software complexity and can be implemented as all 15 mum guaranteed bandwidth. Packet classification informa- 

hardware or software. tion is utilized to determine on a per-packet basis the correct 

A second output of the per flow background update response to the congestion information. Current system 

module, an overall indication of the activity and congestion measurements for example are free queue size, offered rate, 

of the overall system, is created (Current System current per-pipe flow rates, excess bandwidth signal (used in 

Measurements) The Current System Measurements are then 20 reference to the egress system), previous free queue size, 

provided to the per packet action module 116. previous global transmit fraction, previous per-pipe transmit 

The key features of the Per Flow Background Update fraction, and exponentially weighted average of previous 

module are: excess bandwidth signals. 

1. Queue accounting blocks 1142. 2 5 ^ per P acket acti o° module 116 uses the packet clas- 

2. A transmit probability engine 1144 which periodically ^^iion information to select which transmit fraction to 
(every 10 us to 10 ms) calculates drop probabilities W 1 * or what other actlon to W 1 * 

based on factors described previously. The key features of the per packet action module are: 

1. A transmit probability memory 1164, written by the 

Ingress Queue Accounting 30 transmit probability engine 1144 and read by the 

The queue accounting blocks 1142 maintain the follow- mechanism for transmitting or dropping packets, 

ing: 2. A random number generator 1166, which generates a 

transmit decision by comparison (using compare func- 

Free Queue Accounting tion 1160) to the current transmit probability. 

Tne following queue accounting mechanisms are prefer- * 3 ' * * ansm * bl ° ck wnich executes the transmit 
ably utilized for the free queue 110 of the ingress system decsion based on the result of the algonthm combined 

102 wltD Packet classification information and the current 

/ _ — ^ ^ . , , „ „ system measurements. In the preferred embodiment, 

1. TotalCount The TotalCount is decremented for each me block ^ connec|s l0 a number of traffic 

buffer that is allocated during packet reception and it is 40 ^ lQ ^ fates Qf Uansm]i{td and dis . 

incremented for each buffer that is released during packet carded packets 

transmission. This provides a count of the number of buffers 

available in the ingress memory 112 used for the storage of Transmit Probability Memory 1164 

packet data. 

2. Arrival rate (A). Arrival rate of data into the ingress 45 ^ transmit Probability memory is preferably a plurality 
data store. This counter increments each lime a buffer is of entnes > lhe P referred embodiment contains 64 entries. In 
allocated from the free queue. It is periodically sampled to lhe P refcrred embodiment, the transmit probability is imple- 
determine the rate of arrival mented as a 7 bit number indicating a fractional granularity 

of 

3. Exponentially weighted average of TotalCount 

(ExpAvgTotCount). The weighted average is calculated 50 . Selection of the entry is based on the Packet Classification 
according to: information, and the current system measurements. 

E ^f,^ V gTot Con n t =C1 - K) * E xp AvgTo t Cou n t + Random 1166> ^ re 1160> and 

K*TotalCount, where this calculation is periodically Transmit Block 1168 

executed. K is programmable to have various values 55 

including Vs, Vi, Vi and 1. Congestion of the ingress The random number generator 1166 in a preferred 
system 102 is thus determined by an examination of the embodiment is a 32-bit free running random generator, 
above when compared against programmable thresh- Seven or more bits are used as an input to the compare unit, 
olds for each of these measurements. The output of the compare indicates discard when the 

60 random number generated is greater than the transmit prob- 
Transmit Probability Engine 1144. ability. 

The transmit probability engine 1144, is a program or p fi r t l 

device or a combination of a program and device that is gress ow n ro 

periodically triggered by a timer within the ingress system The objective of the egress flow control mechanism is to 

102. It takes the contents of the queue accounting blocks 65 discard packets in an intelligent fashion when there is 

1142, and parameters of the selected flow control algorithm congestion. The ingress flow control mechanism is activated 

and calculates transmit probabilities for different Traffic- on entry to the egress system 106 and on an enqueue 
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operation to the scheduler 134. In the egress system 106, the 
flow control mechanism takes several congestion indicators 
as input, as described previously. Based on these congestion 
indicators, based on programmable transmit probabilities 
and based on a set of selectable algorithms like random early 5 
discard or shock absorber random early discard, the flow 
control mechanism determines if the enqueue operation is 
successful or if the packet is discarded. 

The key features for the egress flow control are similar to 
the ingress flow control previously described. The key 10 
differences are, described below. 

The first invocation in the egress system 106 of flow 
control is when a packet enters the system. When the 
memory 136 is severely congested as indicated by conges- 
tion indicator 131, flow control will discard packets. Several 15 
thresholds can be defined, with packet classification criteria 
that allow discard of different classes of packets due to 
different levels of sever congestion. This mechanism can be 
used to protect critical traffic, such as control traffic, from 
being blocked due to a failure in the flow control mecha- 20 
nism. 

The second invocation of flow control in the egress 
system 106 occurs when the packet is enqueued to the 
scheduler 134. Similar to ingress flow control, an important 
feature of the egress flow control is coarse and fine adjust- 
ments in response to congestion indications. 

As in the ingress system, the perflow background update 
module, 130 provides the coarse adjustments. The egress 
system's 106 fine adjustments are due to measurements and 3Q 
congestion indications for the egress ports and the genera- 
tion of transmit probabilities for flows. 

FIG. 8 is a block diagram of a per-flow background update 
module 130 or the coarse adjustment for the egress flow 
control in accordance with the present invention. The per 35 
flow background update module 130 takes the congestion 
indicator 131 from free queue 128, the congestion indicators 
133fl-133c from the flow queues 127a-127c, as well as 
parameters of the selected flow control algorithmh and 
generates a control response by means of a logical matrix of 40 
transmit probabilities. Typically, the per-flow background 
update module 130 samples its inputs at fixed period and 
computes the control response. 

The selected flow control algorithm's parameters define 
the size of the matrix (number of transmit probabilities) and 45 
the packet classification parameters used when selecting the 
appropriate transmit probability from the matrix. An 
example would be to provide different classes of service for 
different flows; as an example one packet classification may 
have a class of service definition that does not allow any 50 
discarding of packets except in cases of sever congestion, 
while others may permit discarding of packets at lower 
congestion levels. Within set of class of service that permits 
discarding of packets at lower congestion levels, there can 
be a hierarchy of services that vary the probability of frame 55 
discard for a congestion state. Further, the value of the 
transmit probabilities are varied due to the congestion state 
of the system at the time the response is calculated. 

One output of the per-flow background update module 
130 is a transmit probability matrix which is an average 60 
desired transmission fraction for each class in each pipe; i.e. 
the average fraction of all packets in a pipe to be transmitted. 
The other packets are to be discarded. Typically, the per- 
class, per-pipe transmission fractions are refreshed with a 
period ranging in the interval 100 microseconds to 10 65 
milliseconds by a Transmit Probability Engine 1304. In the 
preferred embodiment, the Transmit Probability Engine is a 



combination of hardware and software. The implementation 
selection of the engine 1304 is a trade off between hardware 
and software complexity and can be implemented as all 
hardware or software. 

A second output of the per flow background update 
module, an overall indication of the activity and congestion 
of the overall system, is created (Current System 
Measurements) The Current System Measurements are then 
provided to the per packet action module 132. 

The key features of the Per Flow Background Update 
module are: 

1. Queue accounting blocks 1302. 

2. A transmit probability engine 1304 which periodically 
(every 10 us to 10 ms) calculates drop probabilities 
based on factors described previously. 

Egress Queue Accounting 

The Queue accounting blocks 1302 maintain the follow- 
ing: 

Free Queue Accounting 

The following queue accounting mechanisms are used for 
the egress free queue 128. 

1. TotalCount. The TotalCount is decremented for each 
buffer that is allocated during packet reception and it is 
incremented for each buffer that is released during packet 
transmission. This provides a count of the number of buffers 
available in the egress memory 136 used for the storage of 
packet data. 

2. Arrival rate (A). Arrival rate of data into the egress data 
store. This counter increments each time a buffer is allocated 
from the free queue. It is periodically sampled to determine 
the rate of arrival. 

3. Exponentially weighted average of TotalCount 
(ExpAvgTotCount). The weighted average is calculated 
according to: 

ExpAvgTotCount-(l-K)*ExpAvgTotCount + 
K*Total Count, where this calculation is periodically 
executed. K is programmable to have various values 
including ] /s, and 1. Congestion of the egress 

system 106 is thus determined by an examination of the 
above when compared against programmable thresh- 
olds for each of these measurements. 
The scheduler 134 must prioritize traffic from the switch 
in an intelligent manner. To describe the preferred embodi- 
ment for this function refer to the following. In a preferred 
embodiment, the traffic is provided as priority 0 traffic 
(realtime traffic) and priority 0 traffic (non-realtime traffic). 

An accounting mechanism for priority 1 traffic, includes 
the following counters: 

Priority 1 Counter (PICount). Incremented by the number of 
buffers used by a priority 1 packet when a packet enters the 
scheduler 134 and decremented by the number of buffers 
used by a packet when a packet is transmitted after leaving 
the scheduler 134. 
Arrival rate (A). Arrival rate of priority 1 packet into the 
scheduler 134. This counter increments by the number 
of buffers in use by a packet each time a packet is 
enqueued into the scheduler 134. It is periodically 
sampled to determine the rate of arrival. 
Departure rate (D). Departure rate of priority 1 packets 
from the scheduler 134. This counter increments by the 
number of buffers in use by a packet each time a packet 
is removed from the scheduler 134 to be transmitted out 
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an egress port. It is periodically sampled to determine 1302, and parameters of the selected flow control algorithm 

the rate of departure. and calculates transmit probabilities for different 

ExpAvgPrilCount. Exponentially weighted average of Trafficiypes, different flow queues and different congestion 

the Priority 1 counter, calculated according to: conditions. It writes the results into a transmit probability 

5 memory. 1324 found in the Per Packet Action module 132. 

ExpAvgPhiCouni=(i->0*ExpAvgPrilCouiit+K*PiCounL pic. 9 is a block diagram of a per packet action module 

-j-it j / „ rt , 132 for the egress flow control. Its key features and method 

Hus calculate * periodically executed (every 10 us to f ^ ^esimUar that which w J described for FIG. 7. 

10 ms). K is programmable to have the values including Vs, <M 

V* and 1 Accordingly, the ingress system 102, egress system 106 

\ ' . . . r • a . a: 10 a °d switch fabric 104 utilizing the plurality of congestion 

An accounting mechanism for the priority 0 traffic, im&Mm ^ weU ^ , he ^ J^^* mod . 

includes the followuig counters: u , es rooperate ^ inlelligentlv manage lhe s J ystem m 

Priority 0 Counter (POCount). Incremented by the, num- 
ber of buffers used by a priority 0 packet when a packet CONCLUSION 
enters the scheduler 134 and decremented by the num- 15 

ber of buffers used by a packet when a packet is A method and system in accordance with the present 
transmitted after leaving the scheduler 134. invention provides for a unified method and system for 

ExpAvgPrioOCounter. Exponentially weighted average of lo & c *\ <™oection of congestion with the appropriate flow 
the Priority 0 counter, calculated according to: control respoi^s. ^e method and system utilize congestion 

20 indicators within the ingress system, egress system and the 
ExpAvgPri0CountK3-*)^AvgPriiCount + K*P0Q)iint. switch fabric io conjunction with a coarse adjustment system 

and fine adjustment system within the ingress device and the 
This calculation is periodically executed (every 10 us to e 8n»s device to intelligently manage flows. Accordingly, a 
10 ms). K is programmable to have the values including svstem 311(1 method m accordance with the present invention 
y 8 y 4 ^ an( j i 25 identifies a plurality of logical tests or definitions of con- 

gestion. A response to the congestion can be to discard all 
Port Queue Accounting traffic, change the transmit rate, change the class of the 

packet, or log information the packet. 

In a preferred embodiment, a measurement of the number Aia _ . . . 4 . . . , . - 

ri _ a . , nfl . . .,, Although the present invention has been described in 

of buffers muse by all flows using an egress port is provided. , n , ° f, ... , r 

A - i_ • a j * * i_ 30 accordance with the embodiments shown, one of ordinary 

A count for each priority, 0 and 1. for each egress port are * „ • 4 . . . tl \ ■ . ■ 

maintained- skill m the art will readily recognize that there could be 

variations to the embodiments and those variations would be 

PortCount. Incremented by the number of buffers used by ^thin the spirit and scope of the present invention, 

a packet destined for this target port when a packet Accordingly, many modifications may be made by one of 

enters the scheduler 134 and decremented by the num- 35 ordinary skill in the art without departing from the spirit and 

ber of buffers used by a packet when a packet is scope of the appended claims, 

transmitted. I.e., this counter counts the total number of What is claimed is: 

buffers consumed by packets destined for a given target ^ A system for minimising congestion of data packets in 

port and priority. Sampling of this counter allows the a communication system comprising: 

system to determine if excess bandwidth is available at 40 . 1 iL * - 1 

J. . 4 . 1 . n ^ * • at least one ingress system, the ingress system including 

this target port. For example if the Port Count is c . * * a c * 

1 r JT- * j . u jj • a fi 151 fr ee queue, a first flow queue, a first congestion 

sampled and is found to be non-zero and decreasing, . . n . ! - . . ^ • j- *• 

*u 11 .u *i ui u j j.u ■ * * "i * a adjustment module for receiving congestion indications 

then all the available bandwidth is not utilized, <u .uf a *u a e 

from the free queue and the flow queue, for generating 



Flow Queue Accounting ^ storing transmit probabilities and for performing 

45 per packet flow control actions; 

For each flow queue the following counters are main- a switch fabric for receiving data packets from foe ingress 

tained: system and for providing a congestion indication to the 

A buffer count is maintained which is incremented by the ingress system; and 

number of buffers in use by the packet during enqueue at least one egre&s system for receiving the data from me 

into the flow queue. The buffer count is decremented ^ch fabric , the egress system iDC luding a second free 

durmg dequeue from the flow queue. queue . a flow queue; a congestion 

Arrival rate (A). Arrival rate of packets into the flow adjustment module for receiving congestion indications 

queue. This counter increments by the number of from the second free queue and the second flow queue, 

buffers in use by a packet each time a packet is 55 for generating and storing transmit probabilities and for 

enqueued into a flow queue. It is periodically sampled performing per packet flow control actions, and a 

to determine the rate of arrival. scheduler for adjusting data packets responsive to the 

Congestion of the egress system 106 is thus determined first and second adjustment modules for determining 

by an examination of the above when compared against the order and transmission of data packets out of the 

programmable thresholds for each of these measure- 60 egress system. 

meets. 2. The system of claim 1 wherein each of the first and 

second congestion adjustment modules comprises: 

Transit Probability Engine 1304 a adjustment module for generating and storing 

The transmit probability engine 1304 is a program or transmit probabilities; and 

device or a combination of a program and device that is 65 a fine adjustment module which is responsive to the 

periodically triggered by a timer within the egress system coarse adjustment module for performing per packet 

106. It takes the contents of the queue accounting blocks flow control actions. 
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3. The system of claim 2 wherein the coarse adjustment queue; a second How queue; a second congestion 
module comprises a per flow background update module. adjustment module for receiving congestion indications 

4. The system of claim 3 wherein the per flow background from the second free queue and the second flow queue, 
update module comprises a plurality of queue accounting for generating and storing transmit probabilities and for 
blocks for receiving congestion indications and for provid- 5 performing per packet flow control actions, and a 
ing current system measurements, and a transmit probability scheduler for adjusting data packets responsive to the 
engine coupled to the plurality of queue accounting blocks. tet and second adjustmeDt modules for deterriiining 

5. The system of claim 4 wherein fine adjustment module me ordef and transmission of data packets out of the 
composes a per packet action module for receiving the wherein each of the first and second 
control response. io j- . j i 

6. The system of claim 3 wherein the per packet module coition adjustment modules comprises: 
comprises a transmit probability memory for receiving cur- a Sow background update module for generating 
rent system measurements, packet classification information and sh ™8 11411511111 Payabilities; and 

and Lransmit probability information, a comparator coupled a P er P acket actlOD module whlch 15 responsive to the 

to the transmit probability memory, a random number gen- is P er flow back g rau *d update module for performing 

erator coupled to the comparator and a transmit block „ jpr packet flow control actions, 

coupled to the comparator, for receiving the current system 13 J** s > rslem of claiin 12 wherein the P er flow back ' 

measurements and the packet classification information. ground update module comprises a plurality of queue 

7. The system of claim 5 wherein each of the ingress accounting blocks for receiving congestion indications and 
system and egress system includes at least one per bit rate 20 for P r0Vldm g c^ent system measurements, and a transmit 
module, the at least one per bit rate module for providing a probability engine coupled to the plurality of queue account- 
congestion indication to its associated per packet action ing blocks. 

module. 14 * ^ c system of claim 13 wherein the per packet module 

8. nie system of claim 5 wherein the first free queue comprises a probabi u ly memory for receiving cur- 
provides a congestion indication to its associated per packet 25 rent system measurements, packet classification information 
action module. an( * transmit probability information, a comparator coupled 

9. The system of claim 5 wherein the second free queue lo the lransmit probability memory, a random number gen- 
provides a congestion indication to its associated per packet erator ^Pled to the comparator and a transmit block 
action module. coupled to the comparator, for receiving the current system 

10. The system of claim 1 wherein the second fiee queue 30 measurements and the packet classification information, 
provides a congestion indication to the first free queue. 15 ^ s y stem of claun 14 wherein each of the ingress 

11. The system of claim 1 wherein each of the ingress s y stem and e « ress eludes at least one per bit rate 
system, egress system and switch fabric include a memory module, the at least one per bit rate module for providing a 
coupled thereto. congestion indication to its associated per packet action 

12 . A system for minimizing congestion of data packets in 35 m ochile. 

a communication system comprising: 16 ; nc svstem of clatm 14 wherein the first free queue 

... . „ , t , . . . ,. provides a congestion indication to its associated per packet 

at least one ingress system, the ingress system including action module 

a first free queue, a first flow queue, a first congestion t, a „ r i • 1 >■ u ■ *u *c 

i* . . j i f • • «• * j * . * 17. Ine system or claim 14 wherem the second free queue 

adjustment module for receiving congestion indications .» „• ... , , 

from the free queue and the flow queue, for generating « V ^ K ^^ f!Sta0n to ^ P er P ackel 

perf0rming ^i^e system of claim 12 wherein the second freequeue 

per packet flow control actions; prQvides g ^ ngestion t0 ^ fifSt free ^ 

a switch fabric for receiving data packets from the ingress 19. The system of claim 12 wherem each of the ingress 

system and for providing a congestion indication to the sys tem, egress system and switch fabric include a memory 

ingress system; and coupled thereto. 

at least one egress system for receiving the data from the 

switch fabric, the egress system including a second free ***** 
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